Skip to content

fix: vector clustering key compaction task stuck in analyzing state#48529

Open
xiaocai2333 wants to merge 4 commits intomilvus-io:masterfrom
xiaocai2333:fix/vector-clustering-key-stuck-47540
Open

fix: vector clustering key compaction task stuck in analyzing state#48529
xiaocai2333 wants to merge 4 commits intomilvus-io:masterfrom
xiaocai2333:fix/vector-clustering-key-stuck-47540

Conversation

@xiaocai2333
Copy link
Contributor

Summary

  • Map analyzing state to InProgress in FromCompactionState so the global scheduler keeps the task in runningTasks instead of dropping it
  • Guard QueryTaskOnWorker to skip DataNode queries while the task is in analyzing state, preventing premature state reset

Root cause

When FloatVector is used as a clustering key, CreateTaskOnWorker calls doAnalyze() which sets the task state to analyzing. However, FromCompactionState did not map analyzing to any scheduler state, returning None. This caused the global scheduler to drop the task from both pendingTasks and runningTasks. After processAnalyzing() transitioned the state back to pipelining, no one re-scheduled doCompact(), leaving the task stuck forever.

Test plan

  • TestFromCompactionState — verifies all compaction states map correctly, including analyzing → InProgress
  • TestQueryTaskOnWorkerSkipAnalyzing — verifies QueryTaskOnWorker returns immediately when task is in analyzing state without querying DataNode

issue: #47540

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

…ilvus-io#47540)

When FloatVector is used as a clustering key, the compaction task gets
stuck because the analyzing state was not mapped in FromCompactionState,
causing the global scheduler to drop the task. After processAnalyzing
transitions the state back to pipelining, no one re-schedules doCompact.

Fix by mapping analyzing to InProgress so the scheduler keeps the task
in runningTasks, and guarding QueryTaskOnWorker to skip DataNode queries
while the task is still analyzing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xiaocai2333
To complete the pull request process, please assign xiaofan-luan after the PR has been reviewed.
You can assign the PR to them by writing /assign @xiaofan-luan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added the size/M Denotes a PR that changes 30-99 lines. label Mar 26, 2026
@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Mar 26, 2026
@sre-ci-robot
Copy link
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm
  • /ci-rerun-e2e-default // for ci-v2/e2e-default
  • /ci-rerun-ciloop // for ci-v2/ciloop (build + unit tests in one pipeline)
  • /ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, AMD)
  • /ci-rerun-gosdk-arm // for ci-v2/go-sdk-arm (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

…uest

The analyzeTask.CreateTaskOnWorker was sending an incomplete AnalyzeRequest
with empty SegmentStats, Dim=0, and missing clustering configuration params.
Port the logic from 2.5's PreCheck to populate segment binlog IDs, extract
vector dimension from schema TypeParams, calculate NumClusters, and set all
clustering compaction configuration parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
@sre-ci-robot sre-ci-robot added size/L Denotes a PR that changes 100-499 lines. and removed size/M Denotes a PR that changes 30-99 lines. labels Mar 26, 2026
…ycle

Verify the full state machine: pipelining → analyzing → pipelining
(with AnalyzeVersion) → executing. Covers the three fixes from milvus-io#47540:
- FromCompactionState(analyzing) returns InProgress
- QueryTaskOnWorker skips when analyzing
- CreateTaskOnWorker calls doCompact after AnalyzeVersion is set

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
@codecov
Copy link

codecov bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 51.51515% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.58%. Comparing base (1753251) to head (f24543d).
⚠️ Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
internal/datacoord/task_analyze.go 44.82% 30 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (51.51%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #48529      +/-   ##
==========================================
- Coverage   77.60%   77.58%   -0.02%     
==========================================
  Files        2113     2111       -2     
  Lines      350935   350746     -189     
==========================================
- Hits       272352   272138     -214     
- Misses      70242    70268      +26     
+ Partials     8341     8340       -1     
Components Coverage Δ
Client 79.25% <ø> (ø)
Core 83.98% <ø> (ø)
Go 75.75% <51.51%> (-0.04%) ⬇️
Files with missing lines Coverage Δ
internal/datacoord/compaction_task_clustering.go 71.60% <100.00%> (+0.31%) ⬆️
pkg/taskcommon/state.go 16.88% <100.00%> (+16.88%) ⬆️
internal/datacoord/task_analyze.go 63.86% <44.82%> (-7.87%) ⬇️

... and 44 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Mar 26, 2026
…us-io#47540)

Add end-to-end test that creates a collection with FloatVector as
clustering key, inserts data, triggers clustering compaction, and
verifies the full lifecycle completes (including the analyze step).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
@sre-ci-robot sre-ci-robot added area/test sig/testing test/integration integration test size/XL Denotes a PR that changes 500-999 lines. and removed size/L Denotes a PR that changes 100-499 lines. labels Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/test dco-passed DCO check passed. kind/bug Issues or changes related a bug low-code-coverage add test-label from zhikun, diff coverage > 80% sig/testing size/XL Denotes a PR that changes 500-999 lines. test/integration integration test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants